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Abstract. A run in a word is a periodic factor whose length is at least twice its period 
and which cannot be extended to the left or right without increasing the period. In 
recent years a great deal of work has been done on estimating the maximum number 
of runs that can occur in a word of length n. A number of associated problems have 
also been investigated. In this paper we consider a new variation on the theme. We 
say that the total run length (TRL) of a word is the sum of the lengths of the runs in 
C^j ■ the word and that T(n) is the maximum TRL over all words of length n. We show that 

< t(«) < ^In^/ll + In for all n. We also give a formula for the average total run 
length of words of length n over an alphabet of size ol, and some other results. 
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oo '. 1. Introduction 

(N ■ 

We use notation for combinatorics on words. A word of n elements is x = x[l . . n], 
q' ■ with X [i] being the zth element and x[i . . j] the factor of elements from position i to 
^ . position j. If z = 1 then the factor is a prefix and if / = n it is a suffix. The letters in 
X come from some alphabet A. The length of x, written \x\, is the number of letters x 
contains and the number of occurrences of a letter a in x is denoted by \x\a. Two or 
more adjacent identical factors form a so-called power. A word which is not a power 
is said to be primitive. A word x or factor x is periodic with period p if x[i] = x[i + p] 
for all i such that x[i] and x[i + p] are in x. A periodic word with least period p and 
I length n is said to have exponent n/ p. For example, the word ababa has exponent 5/2 
OO ■ and can be written as {ab)^^^. If x = x[l . . n] then the reverse of x, written K(x), is 

x[n]x[n — 1] • • • x[l] . A word that equals its own reverse is called a palindrome. 
\0 '. In this paper we are concerned with runs. A run (or maximal periodicity) in a word 

X is a factor x[i . .j] having minimum period p, length at least 2p and such that nei- 
O ■ ther x[i — 1 . .j] nor x[i . .j + 1] is a factor with period p. Runs are important because 
of their applications in data compression and computational biology (see, for exam- 
ple, H). In recent years a number of papers have appeared concerning the function 
. Pi^) which is the maximum number of runs that can occur in a word of length n. 
rS • In 2000 Kolpakov and Kucherov [9J showed that p{n) = 0{n) but their method did 
o3 . not give any information about the size of the implied constant. They conjectured that 
p{n) < n for all n which has become known as the Runs Conjecture. In IITSi Rytter 
showed that p{n) < 5n. This bound was improved progressively in |17 fl and |[3l and 
most recently by Crochemore, Hie and Tinta [4J to 1.029n. Their method is difficult and 
heavily computational. Giraud [8J has produced weaker results using a much simpler 
technique. He also showed that lim„_^oo |0(n) / n exists. In the other direction Franek et 
al. showed that this limit is greater than 0.927, a result that was improved by Kusano 
et al. mol and Simpson EQl to 0.944. We therefore have 

0.944 < lim p(n)/n < 1.029. 

These investigations have prompted authors to investigate a number of associated 
problems. Baturo and coauthors looked at runs in Sturmian words [2J. Puglisi and 
Simpson |fT6l gave formulas for the expected number of runs in a word of length n. 
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This depends on the alphabet size, with binary alphabets giving the highest values. 
Kusano and Shinohara ||12 | | obtained similar results for necklaces (words with their 
ends joined). Crochemore [51 and others have investigated runs whose length is at 
least three times the period. Rather than counting the number of runs one can consider 
the sum of the exponents of the runs. The word ababaabaa has runs a^, (ab)^^^ and 
{abay^^, so it contains 3 runs with sum of exponents 41/6. Let e(n) be the maximum 
sum of the exponents of runs in a word of length n. It is known [5 J that for large n 

2.035n < e{n) < 4.1n. 

In this paper we introduce a new variation on this theme. The total run length of a 
word is the sum of the lengths of the runs in the word. The word given above contains 
runs aa, ababa and abaaba of lengths 2, 5 and 6 so its total run length is 13. We write 
TRL{w) for the total run length of a word w and T(n) for max{TRL{w) : \w\ = n}. In 
the next section we give some minor results about total run length (TRL for short) and 
obtain a lower bound on T(n). An upper bound is give in Section|3]and formula for the 
expected TRL in Section HI In the final section we discuss our results and suggest some 
areas for further research. 

2. A LOWER BOUND ON T(n) 

Table 1 (below) gives T(n) for low values of n under the assumption that these are 
attained by binary words, and examples of words that attain these bounds. 

Table 1. Values of T(n) assuming these are obtained by binary words. 



n 


T(n) 


T{n)/n'^ 


Example 


1 








a 


2 


2 


0.5 


aa 


3 


3 


0.333 


aaa 


4 


4 


0.250 


aaaa 


5 


6 


0.240 


aabab 


6 


10 


0.278 


aabaab 


7 


12 


0.245 


aabaabb 


8 


16 


0.250 


aabbaabb 


9 


19 


0.235 


abaaabaab 


10 


29 


0.290 


aababaabab 


11 


32 


0.264 


abaababaaba 


12 


37 


0.257 


abaababaabab 


13 


42 


0.249 


ababbababbaba 


14 


47 


0.240 


aaabaabaaabaab 


15 


53 


0.236 


abaabababaababa 


16 


60 


0.234 


aabaababaabaabab 


17 


70 


0.242 


ab abaabababaababa 


18 


73 


0.225 


aababaabababaababa 


19 


80 


0.222 


ab aabab aabaab abaaba 


20 


85 


0.212 


abaababaab aabab aabab 


21 


92 


0.209 


ababaabababab aabab aba 


22 


99 


0.205 


aababaababaaababaababa 



We do not know whether binary words are best, though this seems likely. The same 
uncertainty exists for p{n) and is discussed in [JJ. If a word is optimal then so is its 
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reverse and its complement (formed by interchanging a and b). In most cases the words 
attaining the bounds in the table are unique up to these transformations. Note that in 
many cases these words are palindromes. 

If we use an alphabet of size greater than 2 we can construct words of any length 
containing no runs (see Section 3.1.2 of CH) and so having TRL equal to 0. For binary 
words the minimum values of TRL{w) for iv of length up to 5 are 0,0,0,2,2. For larger 
values of n we have the following. 

Theorem 1. The minimum value of TRL{w) for binary words of length n, n > 6, is n — 4, 
and is attained by the word aba"~^ba. 

Proof. Clearly the given word has TRL equal ton — 4. We must show no binary word of 
length n has a lower TRL. Suppose the word w does. Then it has TRL(w) < n — 5 and 
therefore there are at least 5 letters in w which do not belong to any run. Consider the 
middle of these 5, and without loss of generality suppose it's a. Its neighbours cannot 
equal a as then it would belong to a run. Therefore it is the central a of some factor 
ab^^ab^'^a. If ki < k2 then we have a preficial run ab^^ab^'^ so the central a does belong 
to a run, contradicting the hypothesis. If ki > k2 an analogous argument applies. 

□ 

Theorem 2. For n > Iwe have T(n) > n^/8. 

Proof. From Table 1 we see this holds up to n = 5. For even n greater than 5 let u{k) = 
((fl&)'^fl)2.Wefind 

n = \u{k)\ =4k + 2 

and 

TRL{u{k)) = 2k^ + 8k + 4 

= (n2 + 4n + 12)/8 

so T(n) > (n^ + 4n + 12) /8 and the theorem holds for even n. For odd n note that 
T(n) > T(n — 1) > (n^ + 2n + 9) /8 so the bound also holds in this case too. 

□ 

3. An upper bound for T(n) 
We first assemble some lemmas. 

Lemma 3. H (The Periodicity Lemma) If x is a word having two periods p and q and 
\x\ > p + cj — gcd{p,q) then x also has period gcd{p,q). 

Lemma 4. (Lemma 8.1.1 of fM]) Let a. be a word having two periods p and q with q < p. 
Then the suffix and prefix of length \a\ — q both have period p — q. 

Lemma 5. (Lemma 8.1.2 of [14]) Let a, b and c be words such that ab and be have period p 
and |b| > p. Then the word abc has period p. 

Lemma 6. Ifw is a word for which w[1..2p] has period p and w[k + l..k + 2p + 2] has period 
p + 1, where < k < p then w has the form: 

(1) w = Xx^-^Xx^-^+^Xx 

where x is a letter and \X\ = k. 
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Proof. Note that iu[k + 1..2p] has periods p and p + 1. By Lemma|4]its prefix w[k + l..p] 
(which is empty if = p) has period 1. Say this is xP~^. Then by the p periodicity 
w[p + k + 1..2p] = x^"*^. By the p + 1 periodicity w[/c + 2p + 2] = w[p + k + l] = x 
and w[2p + 1] = iv[p] = x. Let = X so that |X| = k. Then by the p periodicity 

w[p + l..p + k] = X and then, by the p + 1 periodicity, w[2p + 2..2p + k + 1] = X. 
Assembling all this gives □ 

Remark. It is clear that if w[1..2p + 2] has period p + 1 and w[k + 3..2p + 2] has period 
p then w is the reverse of the right hand side of (d). 

Theorem 7. Ji zs not possible for a letter simultaneously to belong to two distinct runs with 
period p and two distinct runs of period p + 1. 

Proof. The proof is by contradiction. If there exists a counterexample to the theorem 
then it has as prefix and suffix two words each of which has a length 2p prefix with 
period p and a length 2p + 2 suffix with period p + 1, or vice versa, and is such that the 
four periodic factors have at least one letter in common. 
Let 

oi = xXx x^Xx^ X 

and 

^ = yYyy'YyHY 

where 

(2) |x| +s = |y| + i = p. 

By Lemma [6] each of a. and /3 has a prefix of length 2p + 2 with period p + 1 and a 
suffix of length 2p and period p. The intersection of these two squares is underlined. 
We write R{ix) and R{fi) for the reverses of a and /3. We consider four cases. 

Case 1. A word w has prefix a and suffix R{fi) with the underlined factors having 
non-empty intersection. 

Case 2. A word w has prefix R{oi) and suffix /3 with the underlined factors having 
non-empty intersection. 

Case 3. A word w has prefix a and suffix j6 with the underlined factors having non- 
empty intersection. 

Case 4. A word w has prefix R{oi) and suffix K(/3) with the underlined factors having 
non-empty intersection. 

If the statement of the Theorem is incorrect then a word belonging to one of these 
cases must exist with the stated periods being minimal and the four squares belonging 
to four different runs. We will show that in each case this cannot occur. 

Case 1. We have 

a = xXx x'Xx' X and R{^) = Yy^Yy^yYy. 

Let d be the length of the intersection of these two words. The intersection must have 
length less than p else, by Lemma |5l the two period p squares would belong to the 
same run of period p. We must also have d > |X| + |y| else the underlined factors 
would not intersect. Recall that |x^X| = |y^y| = p so the intersection is a suffix x'X of 
x^X and a prefix Yyj of Yy^ Thus 

\x'\ + IXl > IXl + \Y\ 
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SO i > \Y\ which implies that x = y and that of X and Y is a power of x. It follows that 
w is a power of x and the four squares belong to a single run. 

Case 2. We have 

R{a) = X x'Xx' xXx and ^ = yYy y^Yy^ Y. 

Let d be the length of the intersection of these two words. Now R{oi) and fi have, re- 
spectively, a suffix with period p + 1 and a prefix with period p + 1. As in Case 1 we 
must have d < p + 1. In order that the underlined factors intersect we must also have 
rf>4+|X| + |y|. The intersection is thus 

x^Xx = yYy} 

where z + 1 + |X| > |X| + |y| + 4 implying i > |y| + 3. As in Case 1 this implies that 
X = y and that X and Y are powers of x as is the whole word w. 

Case 3. We have 

a = xXx x'Xx' X and ^ = yYyy^Yy^Y. 

This case is more complicated than the others. Let us add another y to the right hand 
end of |6. Set ^' = ^y and Y' = Yy. Then 

^' = yY'yy^-'^Yy-'^Y'. 

This has the same form as /3 but its underlined factor is one letter shorter and begins 
one position further to the right. Suppose we have a word with prefix a and suffix fi 
in which the underlined factors intersect. By iterating the construction just described 
we can arrange that the two underlined factors have an intersection of length one. This 
will be the final x in the underlined factor of a and the initial y in the underlined factor 
of /S. Thus X = y. We suppose, without loss of generality, that this is the case with our 
word. 

X and y may have prefixes or suffixes which are powers of x. We set 

X = x'Ux^ 
Y = x'Vx^ 

for non-negative integers a, b, c and d, where U and V neither begin nor end with x. 
Equations (O now become 

(3) \U\ + a + b + s = \ V\ + c + d + t = p. 

and we have 

a = x'+^Ux^+^x^^nix^ x'Ux^ 
^ = x'+^Vx'^+'^x^^^Vx^ x'Vx^. 

By our assumption the last x in the underlined section of a coincides with the 
first X in the underlined section of j6. This means that x'^^'^Vx'^^^ is a suffix of 
x''+^Lfx^+^ x''+'Lfx^+^~^ so that U = V and b + s - I = + It also means that 
x'^Ux^ is a prefix of x'^~^'^~^Vxd + t x'^Vx'^ so that a = c + t — 1. Together this gives 
a + b + s — l = c + d + t, which contradicts (HJ). 



Case 4. This is just the reverse of Case 3 and need not be separately considered. The 
proof is complete. 

□ 
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Theorem 8. For all n we have T(n) < A7rP- /72 + 2n. 

Proof. Periods of runs in a word of length n must be less than or equal to n/2. 

Consider runs with periods in {2q — 1, 2q} for 1 < q < [n/6j . By Lemma[6]no letter 
can belong to more than three such runs so the contribution to the TRL is at most 3n 
for each such pair, and the contribution from all such pairs is at most 3n [n /6 J . 

Now consider runs with periods in {2q — l,2q} for [n/6j + 1 < q < {n/^]. The 
upper bound here ensures that the maximum value of 2q is at least equal to [n/2\. For 
some values of n we will be counting more runs than we need. The number of pairs 
{2q-l,2q}is \n/4] - [n/6j. 

We first show that there can be at most one run in a word of length n for each period 
in the set under consideration. Let p be such a period. Then p > 2{[n/6\ +1) > n/3. 
If we had two runs with period p their intersection would have length at least 4p — n 
which is greater than p. This is impossible by Lemma |5l So we have at most one run for 
each period p. Suppose there is a run of length x with period 2q — 1 and a run of length 
y with period 2q. These have intersection of length at least x + y — n. By LemmalU this 
must be less than 

2q + 2q-l- gcd{2q - 1, 2q) = 4q - 2 

else the runs will collapse into a single run with period 1. So x + y < n + — 3. The 
contribution from all such pairs to the TRL is at most 

r«/4i 

J] n + 4:q-3= {\n/4:] - [n/6\){n - 3 + 2{\n/4:] + [n/6\ +1)). 

q=[n/6\+l 

Adding this to the bound for the shorter periods we see that the TRL is less than 

{\n/4] - ln/6\){n-3 + 2{\n/4] + [n/6j + 1)) + 3n [n/6j . 

We can show that this is less than the bound in the theorem by considering values of n 
in each residue class modulo 12. 

□ 

4. The expected value of TRL 
Theorem 9. The expected TRL for a word of length n on an alphabet of size a is 

_ 112 L(«-2)/2j n-2v-ln-i-l ^ _ 1 U''-1)/2J n-1 „L"/2J 

(4) E m E E ^^-^^+2^ ^ ^(p) E^^-' + Jr E ^(p)' 

where P{p) = J2d\p ix-'^^{p/d) is the number of length p primitive words on an alphabet of size 
a (see ||T3l Eq. 1.3.7]) and }i is the Mobius function. 

Proof. We count the sum of the TRLs of all words of length n on an alphabet of size a. 
We first sum the TRLs of those runs which are neither prefixes nor suffixes. 

Consider runs of the form x[i + l..i + fc], where 1 < i and z ' + < n, which have 
period p. For such runs x\l..i — 1] can be any word, so there are a^"^ possibilities for 
this factor. The letter x[i\ must be chosen so that the run does not extend to the left of 
x[i + 1]. There are oc — 1 such choices. The factor x[i + l..i + p\ is the generator of the 
run and can be any primitive word of length p, for which there are P{p) choices. The 
rest of the run is then determined by its periodicity. The letter x[z ' + + 1] is chosen in 
one of a — 1 ways to avoid the run extending to the right. This leave the final factor 
x[i -\-k-\- 2..n] which can be chosen in ci^~'^~^~^ ways. The number of words having a 
run of the required form is therefore 

- l)P(p)(a - ly""-'-^'^ = {cc- lfcc''-^-^P{p). 
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The variable i can take any value from 1 to n — 2p — 1 and, for each such i, k can take 
the values 2p to n — i — 1. The length of the run is k so the sum of total run lengths of 
all runs which are not suffixes are prefixes, which have period p, in all words of length 
n is: 

M-2p-l n-i-l 

p{p) L E - 1)'^"-'-'^ 

i=l k=2p 

n-2p-l n-i-l 

= (a-l)V-2p(p) £ cc-^k. 

i=l k=2p 

Now consider those runs which are prefixes of x but not suffixes (that is, their length 
is less than n). Say x[l..fc] is such a run with period p. We have P{p) choices for x[l..p], 
x[p + l..k], x[k+l\ can be chosen is a: — 1 ways and the rest of the word in ft;""^"^ ways. 
The run length k can take any value from 2p to n — 1. The sum of the total run lengths 
of all prefix runs with period p, in all words of length n is: 

Pip) E - = (a - i)a"-ip(p) "f; oc-'^k. 

k=2p k=2p 

By symmetry this is also the total for runs which are suffixes but not prefixes. Finally 
the number of runs which cover the whole word is just P(p) and these all have length 
n. The sum of the total run length of all runs with period p is therefore: 

i=2 k=2p k=2p 

A complication arises here because the maximum period p depends on which of the 
four cases we are considering. It is not hard to see that if the run is neither a prefix nor 
a suffix then its period is at most [(n — 2)/ 2 J, if it is a prefix but not a suffix, or vice 
versa, then its period p is at most [(n — 1) /2j and when it is both a prefix and a suffix, 
p is at most [n/2j . Allowing for these different bounds, summing over p and dividing 
through by a" (the number of words of length n) gives the required formula. 

□ 



Let us say that the TRL-density of a word x is TRL{x) /\x\. 
Corollary 10. As n tends to infinity the expected density of a word on alphabet size a tends to 

p=l 



where P{p) is as defined in Theorem^ 
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Proof. We write Si(n), S2{n) and S3(n) for the three terms in dU), that is, 

^^2 L(«-2)/2j n-2p-ln-i-l 



(5) Si(n) = 1^-^ E ^'(P) E E 

p=l 2=1 fc=2p 



. L(«-i)/2J «-l 
S2(n) = 2^ E ^'(P) E 

Im/2J 



S3(-) = E ^'(P)- 



We write S(n) for Si(n) + S2(n) + S3(n). We will obtain lim„^oo S(n + 1) — S(n) and 
show that this is a finite constant depending only on oc. It follows that this limit 
equals lim„^>ooS(n)/n which is the required expected density. It is easy to show that 
lim„_>oo S3(n) equals 0. This is not surprising since S2,{n) counts the contribution to 
S(n) from words which are themselves runs. Such words are rare among the a" words 
of length n. It follows that 

(6) lim S3(n + 1) -S3(n) = 0. 
Now consider the term S2{n). This equals 

As n goes to infinity the sum of the second term in the parentheses goes to so we 
have 

, . , ^ ^^-1 L("^)/2J 2(p-l) + l 

limS2(n)=2 T Pip) J ■ 

p=l 

Noting that 1 < P{p) < p we see that this limit exists and is finite. For a = 2 it equals 
7.9100. It follows that 

(7) lim S2(n + 1) -S2(n) = 0. 

H— )-00 

Next consider Si(n). A change in order of summation gives 

- 1^2 n-3«-l-/ [k/2\ 

Si{n) = ^^-3^ E E E P(p)- 

^ i=l k=2 p=l 

Then Si(n + 1) — Si(n) equals 

U_1^2 r 2 [k/2\ „-3 [{n-i)/2\ 

1^-2^ E E Pip) + E - 0^-"+' E Pip) 

" {k=2 p=l 2=1 p=l 

The first term in the parentheses is the i = n — 2 term in the first sum in ((S)). The second 
term corresponds to the terms with k = n — i. Since P(l) = a. for any oc this becomes, 
after changing the index of summation to j = n — i, 

ice - 1)2 f 2 L;V2J 



^2 



- + E E Pip) 

i=3 p=l 
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Changing the order of summation again gives 

p=l j=max(3,2p) 



I a. 



To simplify matters we start the second sum at / = 2p. This means we are including an 
unwanted term corresponding to j = 2, p = 1. This is 2P{1) /a.^ = 2/ a, which equals 
the first term in the parentheses. We thus have 

r^_1^2 L(«-1)/2J n-l 

Ss{n + 1) - S3(n) = i^-^ E ^'(P) E 

" p=l /=2p 



We now take the limit as n goes to infinity. The second term makes no contribution to 
this limit since it's dominated by a"". So we have 

lim Ssin + 1) - S3(n) = lim £ P(p)?Pi^^-i^. 

p=l 

Summing this with ^ and dZl completes the proof. 

□ 

Some values of expected TRL-density are given in Table 2 below, along with corre- 
sponding results for the number of runs and the sum of exponents of runs. 

Table 2. The columns show the expected number of runs per units 
length of a word, the expected sum of exponents per units length and 
the expected TRL per unit length for various alphabet sizes. The values 
come from [|16l , | [TT| and Corollary [TO] respectively. 

Alphabet size Runs Exponents TRL 



2 0.4116 1.1310 1.9775 

3 0.3049 0.7382 1.0290 
5 0.1933 0.4304 0.5208 
10 0.0991 0.2087 0.2296 



5. Discussion 
Theorems |2] and |8] show that, for all n, 

1/8 < T{n)/n^ < 47/72 + 2/n. 

Both these bounds might be improved. Lower bounds for p{n) and e(n) were obtained 
by constructing words which were rich in the appropriate way. The word u{n) of The- 
orem |2] is comparatively simple. One could look for something better using the tech- 
niques of IITSi or some combinatorial heuristic such as simulated annealing or genetic 
algorithms. 

The upper bound is probably far from best when n is large, though from Table 1 one 
might suspect that the maximum value of T{n)/n^ occurs when n = 2. We have not 
been able to show that lim„_j.oo T{n)/n^ exists though this is probably true. Giraud's 
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method fSl for showing the existence of lim„_s.oo|0(n)/n does not seem applicable to 
our situation. His method also showed that the limit is the supremum of the function. 
In our case it may be the infimum of {T(n) /n^ : n > 1}. Extending Table 1 might give 
insight into these questions. 
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